My IODS project github data repository


1 About the project

Welcome to my IODS 2019 course project page!

This project diary is basically me trying to learn how to do statistical analyses with R. I look forward to improving my R skills and learning some good open data practices.

What I expect to learn?

  • Open data practices
  • Practice R
  • R Markdown (this is my first time using)
  • Recap some data science analyses and good scientific practices


Where I learned about the course?

I think I just brosed the HYMY courses in WebOodi








2 Regression and model validation

Overview of the Data


The data is from Kimmo Vehkalahti’ study ASSIST 2014, which measured the approaches to learning of 183 university students in the Introduction to Social Statistics course in Fall 2014.

## [1] 166   7
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...


The data has 7 variables:

  • Gender: M (Male), F (Female)
  • Age (in years) derived from the date of birth
  • Global attitude toward statistics
  • Deep learning measured with 12 items
  • Strategic learning measured with 8 items
  • Surface learning measured with 12 items
  • Exam points

Composite variables assessing the approaches to learning were formed combining items measuring each construct by calculating the item means for each participant. 166 cases were included in the analyses as 17 cases with the score 0 for exam points were omitted.



## # A tibble: 2 x 2
##   gender     n
##   <fct>  <int>
## 1 F        110
## 2 M         56
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 25.51205
## [1] 17 55


There are 110 females and 56 males in the dataset and the ages of the participants range from 17 to 55 years with the average age being 25.5.

##          vars   n  mean   sd median trimmed  mad   min   max range  skew
## age         1 166 25.51 7.77  22.00   23.99 2.97 17.00 55.00 38.00  1.89
## attitude    2 166 31.43 7.30  32.00   31.52 7.41 14.00 50.00 36.00 -0.08
## deep        3 166  3.68 0.55   3.67    3.70 0.62  1.58  4.92  3.33 -0.50
## stra        4 166  3.12 0.77   3.19    3.14 0.83  1.25  5.00  3.75 -0.11
## surf        5 166  2.79 0.53   2.83    2.78 0.62  1.58  4.33  2.75  0.14
## points      6 166 22.72 5.89  23.00   23.08 5.93  7.00 33.00 26.00 -0.40
##          kurtosis   se
## age          3.24 0.60
## attitude    -0.48 0.57
## deep         0.66 0.04
## stra        -0.45 0.06
## surf        -0.27 0.04
## points      -0.26 0.46


Other variables seem to be quite symmetrically distributed with a slight negative skewness on deep learning and exam points. By visual inspection, it seems there might be a gender difference in attitude towards statistics with men having more positive attitude in average.

Attitude and exam points are moderately correlated. There is also a weak positive and a weak negative correlation between exam points and strategic learning, and exam points and surface learning, respectively. Intrestingly, deep learning is not correlated with exam points. Surface learning is also negatively correlated to deep and surface learning as well as attitude (r = -.32 … -.16).








Regression models

Model 1

## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

I chose attitude, strategic learning and surgace learning as predictors in my regression model based on the strenght of correlation on exam points.

The F-statistic gives a test of the omnibus null hypothesis that all regression coefficients are zero. The F-statistic for the Model 1 is 14.13 with a p value less than .001. Following, it is highly unlikely that there are no non-zero regressions in the model and we can reject the null hypothesis.

The square of the multiple correlation coefficient (R^2) is .207, which signifies that the variables in the model account for about 21% of the variation in exam points.

However, the non-significant t-value of strategic and surface learning implies that attitude seems to be the only statistically significant predictor in the model (t = 5.91, p < .001). The t test tests whether the regression coefficient differs from zero.

The unstandardized regression coefficients are reported under “Estimate” in the “Coefficients” table. The coefficient .34 (p < .001) of attitude implies the strength of the relationship between attitude and exam points in the original scales of the variables, when strategic and surface learning is controlled for. However, we cannot make judgements about the relative importance of the predictor on the predicted variable using unstandardized coefficients. We can obtain the standardized values by multiplying the raw regression coefficient by multiplying the raw coefficient by the standard deviation of the explanatory variable and dividing by the standard deviation of the response variable:

attitude: 0.34 × 7.30 / 5.89 = 0.42
stra: 0.85 × 0.77 / 5.89 = 0.11
surf: -0.59 × 0.77 / 5.89 = -0.08

The standardized beta coefficient of attitude on exam points in this model is .42. Strategic (β = .11) and surface learning (β = -.08) did not statistically significantly predict exam points.

From the “Residuals” table, we can also see how the model residuals are distributed.

Model 2

## 
## Call:
## lm(formula = points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

In Model 2 the two nonsignificant predictors, strategic and surface learning, were omitted and exam points were predicted only by attitude. The null hypothesis is rejected with statistically significant F value 38.61, p < .001.


Interpreting the model

The squared multiple correlation coefficient (R^2) ind Model 2 is .19, which implies that attitude explains 19% of variance in exam points. According to the t test, the regression coefficient of attitude differs from 0 with t = 6.21, p < .001. The unstandardized regression coefficient of attitude on exam points is .35 (p < .001). This implies that when attitude towards statistics increases by 1 in the original scale, there is an average of 0.35 increase in exam score. The standardized regression coefficient is is 0.44, which implies that when attitude increases by 1 SD, exam points increase 0.44 SD. Based on the results, we can say that students with more positive attitude on average score higher on the course exam, but the effect is relatively weak.


Regression diagnostics


The regression model has several assumptions:

  1. Linear relationships
  2. Multivariate normality
  3. No or little multicollinearity (strong correlations between predictors)
  4. Normality of residuals
  5. Homoscedasticity of residuals or equal variance
  6. The predictor variables and residuals are uncorrelated

The Residuals vs. Fitted plot shows a linear relationship between the variables. The normality of the variables was examined before the analysis via visuals and descriptive statistics. As there is only one predictor, there cannot be multicollinearity. The standardized residuals also seem to be normally distributed as implied by the relatively straight line in the Normal Q-Q plot. The Scale-Loaction plot shows that the residuals are spread equally along the range of the predictor which implies homoscedasticity. Lastly, the Residuals vs Leverage plot shows no influential outliers.








3 Logistic regression

Task 2: The data



The data is Student Performance Data Set from UCI Machine Learning Repository. It approaches student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). Full description of the original data can be found here.

The two datasets have been merged using several variables as identifiers to combine individual students information: school, sex, age, address, family size, parents’ cohabitation status, parents’ education and job, reason to choose school, attendance to school nursery, and home internet access. If the student had answered to the same question on both questionnaires, the the rounded average was calculated. If the question was non-numeric, the answer on Mathematics performance dataset was used.

The R script about creating the merged dataset can be found here.


Variables

##  [1] X          school     sex        age        address    famsize   
##  [7] Pstatus    Medu       Fedu       Mjob       Fjob       reason    
## [13] nursery    internet   guardian   traveltime studytime  failures  
## [19] schoolsup  famsup     paid       activities higher     romantic  
## [25] famrel     freetime   goout      Dalc       Walc       health    
## [31] absences   G1         G2         G3         alc_use    high_use


Table 1 Information about variables used in analyses
Variables Information
sex ‘s sex (binary: ’F’ - female or ‘M’ - male)
Medu mother’s education (numeric: 0 none, 1 primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education)
failures number of past class failures (numeric: n if 1<=n<3, else 4)
absences number of school absences (numeric: from 0 to 93)
high_use high alcohol consumption (TRUE: the average of self-reported workday and weekend alcohol consumption greater than 2 on a scale 1 -very low - 5 very high, FALSE: the average 2 or lower)


Let’s take a glimpse of the structure and dimensions of the subset of data we are using:


## Observations: 382
## Variables: 5
## $ sex      <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,...
## $ Medu     <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,...
## $ G3       <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,...


As Medu is truly a categorical variable, not a numerical one, it will have to be recoded into a factor.









Task 3: Hypotheses

The purpose of my analysis is to study the relationships between high/low alcohol consumption and students’ demographic, social, and school-realted characetristics.

I chose following variables to explain the students’ alcohol consumption: student’s sex, father’s education, class failures, and absentees.

  1. Student’s sex: Boys and young men usually consume more alcohol compared to girls or young women.
  2. Mother’s education: Parents’ education has been found to be a significant predictor of different social outcomes and well-being. Families with higher education require less support from sociaty in general. Differences in education level might also reflect differences in parents’ own behavior, values and parental support.
  3. Failure: Difficulties stack up and failures might lead to disaffection towards school, which in turn might lead to valuing other activities and social circles that accept or encourage alcohol consumption.
  4. Absences: Absences might be indications of disaffection and negative attitudes towards school, or other problems in life.








Task 4: Exploring the data


Table 2 Descriptives of chosen variables
vars n mean sd median trimmed mad min max range skew kurtosis se
sex* 1 382 1.482 0.500 1 1.477 0.000 1 2 1 0.073 -2.000 0.026
Medu* 2 382 3.806 1.086 4 3.892 1.483 1 5 4 -0.384 -1.037 0.056
failures 3 382 0.202 0.583 0 0.033 0.000 0 3 3 3.034 8.689 0.030
absences 4 382 4.500 5.463 3 3.536 2.965 0 45 45 3.187 16.186 0.279
G3 5 382 11.458 3.310 12 11.631 2.965 0 18 18 -0.459 0.180 0.169


Tale 3 Students’ sex
sex n
F 198
M 184

Table 3 High/low alcohol consumption by students’ sex crosstabulated
high_use F M
FALSE 156 112
TRUE 42 72


Table 4 High/low alcohol consumption by mother’s education crosstabulated
high_use None Primary 5th to 9th Secondary Higher
FALSE 1 33 80 59 95
TRUE 2 18 18 36 40

Table 5 High/low alcohol consumption by mother’s education crosstabulated
high_use 0 1 2 3
FALSE 244 12 10 2
TRUE 90 12 9 3

Table 5 Mean number absences by high/low alcohol consumption and sex
sex high_use count mean_absences
F FALSE 156 4.22
F TRUE 42 6.79
M FALSE 112 2.98
M TRUE 72 6.12

Table 6 Mean grade by high/low alcohol consumption and sex
sex high_use count mean_grade
F FALSE 156 11.397
F TRUE 42 11.714
M FALSE 112 12.205
M TRUE 72 10.278








Task 5: Logistic regression

## 
## Call:
## glm(formula = high_use ~ Medu + failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3466  -0.8419  -0.6057   1.0340   2.3030  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.07693    1.26355   0.061   0.9514    
## MeduPrimary    -1.67979    1.29798  -1.294   0.1956    
## Medu5th to 9th -2.65576    1.29394  -2.052   0.0401 *  
## MeduSecondary  -1.72156    1.28764  -1.337   0.1812    
## MeduHigher     -2.00750    1.28668  -1.560   0.1187    
## failures        0.43321    0.20127   2.152   0.0314 *  
## absences        0.09627    0.02427   3.967 7.28e-05 ***
## sexM            0.97945    0.24932   3.928 8.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 412.83  on 374  degrees of freedom
## AIC: 428.83
## 
## Number of Fisher Scoring iterations: 4


As Medu does not consistently predict high alcohol consumption well (the only significant coefficient being class “5th to 9th” [p =.040], I omitted the variable from the model. This also makes the model easier to interpret.


## 
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1855  -0.8371  -0.6000   1.1020   2.0209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.90297    0.22626  -8.411  < 2e-16 ***
## failures     0.45082    0.18992   2.374 0.017611 *  
## absences     0.09322    0.02295   4.063 4.85e-05 ***
## sexM         0.94117    0.24200   3.889 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 424.40  on 378  degrees of freedom
## AIC: 432.4
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
Table 7 Odds ratios and their confidence intervals
OR 2.5 % 97.5 %
(Intercept) 0.149 0.094 0.229
failures 1.570 1.083 2.295
absences 1.098 1.052 1.151
sexM 2.563 1.604 4.149

As we can see from the model summary, the intercept of high alcohol consumption is -1.90, which is more than 8 standard deviations (z value or the Wald’s Test value) away from 0 on the standard normal curve with a statistically significant p < .001. The slope coefficient of, for example, absences is 0.093. This means that for one point increase in absences the log of the odds of high alcohol consumption increases 0.09. The z values of coefficients of failures, absences, and sex ar positive and over 2 standard deviations away from 0 and are statistically significant with p < .05 for failures and p < .001 for absences and sex.

From the odds ratios we we can see that when the effect of the other predictor variables are taken into account…

  • The odds of high alcohol consumption increases about 8% to 130% with each class failure.
  • The odds of high alcohol consumption increases about 5 to 15 % with every absence.
  • The odds of male students to consume high amounts of alcohol is about one-and-a-half to to four times the odds of female students.

Conclusion: As expected, class failures, school absences, and student’s sex predict higher alcohol consumption. Male students are more likely be high alcohol consumers. Class failures and absentees also increase the probability of higher consumption. However, mother’s education doesn’t seem to predict alcohol use consistently.









Task 6: Predictive power

##           Medu failures absences sex high_use probability prediction
## 373    Primary        1        0   M    FALSE  0.45259271      FALSE
## 374     Higher        1        7   M     TRUE  0.53891660       TRUE
## 375 5th to 9th        0        1   F    FALSE  0.07708996      FALSE
## 376     Higher        0        6   F    FALSE  0.20538904      FALSE
## 377 5th to 9th        1        2   F    FALSE  0.12421781      FALSE
## 378  Secondary        0        2   F    FALSE  0.18968092      FALSE
## 379    Primary        2        2   F    FALSE  0.36728009      FALSE
## 380    Primary        0        3   F    FALSE  0.21181023      FALSE
## 381  Secondary        0        4   M     TRUE  0.43043082      FALSE
## 382  Secondary        0        2   M     TRUE  0.38399298      FALSE
##         prediction
## high_use FALSE TRUE
##    FALSE   257   11
##    TRUE     78   36

##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67277487 0.02879581 0.70157068
##    TRUE  0.20418848 0.09424084 0.29842932
##    Sum   0.87696335 0.12303665 1.00000000
## [1] 0.2329843


The average number of frong predictions in the data is 23% using student’s sex, absences, and class failures as predictors. This means that the prediction was right about three times out of four. This is significantly better than by just guessing (error rate of 50%). The model was especially accurate at predicting low alcohol consumption by preidcting correctly 257 times out of 268. However, the model predicted wrong most of the cases where the alcohol consumption was categorized high.








4 Clustering and classification

Task 2: The data

Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.



The Boston dataset (Harrison & Rubinfield , 1978; Belsey, Kuh, & Welsch, 1980) from the MASS package is about the housing values in suburbs of Boston. Information about the dataset can be found here. The dataset contains following variables:


Table 1 Variables in the dataset
Variables Information
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes in $1,000s.
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...


The dataset has 506 observations and 14 variables.


Original source:

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.



Task 3: Exploring the data

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

Table 2 Descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis se
crim 1 506 3.61 8.60 0.26 1.68 0.33 0.01 88.98 88.97 5.19 36.60 0.38
zn 2 506 11.36 23.32 0.00 5.08 0.00 0.00 100.00 100.00 2.21 3.95 1.04
indus 3 506 11.14 6.86 9.69 10.93 9.37 0.46 27.74 27.28 0.29 -1.24 0.30
chas 4 506 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00 3.39 9.48 0.01
nox 5 506 0.55 0.12 0.54 0.55 0.13 0.38 0.87 0.49 0.72 -0.09 0.01
rm 6 506 6.28 0.70 6.21 6.25 0.51 3.56 8.78 5.22 0.40 1.84 0.03
age 7 506 68.57 28.15 77.50 71.20 28.98 2.90 100.00 97.10 -0.60 -0.98 1.25
dis 8 506 3.80 2.11 3.21 3.54 1.91 1.13 12.13 11.00 1.01 0.46 0.09
rad 9 506 9.55 8.71 5.00 8.73 2.97 1.00 24.00 23.00 1.00 -0.88 0.39
tax 10 506 408.24 168.54 330.00 400.04 108.23 187.00 711.00 524.00 0.67 -1.15 7.49
ptratio 11 506 18.46 2.16 19.05 18.66 1.70 12.60 22.00 9.40 -0.80 -0.30 0.10
black 12 506 356.67 91.29 391.44 383.17 8.09 0.32 396.90 396.58 -2.87 7.10 4.06
lstat 13 506 12.65 7.14 11.36 11.90 7.11 1.73 37.97 36.24 0.90 0.46 0.32
medv 14 506 22.53 9.20 21.20 21.56 5.93 5.00 50.00 45.00 1.10 1.45 0.41

Variables crim and zn, are considerably positively skewed with a strong floor effect, whereas black is negatively skewed with a strong ceiling effect. Variables indus, rad, and tax seem to be are bimodal or have gaps in the histogram followed by high value peaks.


Table 3 Correlations
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29 -0.39 0.46 -0.39
zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39 0.18 -0.41 0.36
indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38 -0.36 0.60 -0.48
chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12 0.05 -0.05 0.18
nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19 -0.38 0.59 -0.43
rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36 0.13 -0.61 0.70
age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26 -0.27 0.60 -0.38
dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23 0.29 -0.50 0.25
rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46 -0.44 0.49 -0.38
tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46 -0.44 0.54 -0.47
ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00 -0.18 0.37 -0.51
black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18 1.00 -0.37 0.33
lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 -0.37 1.00 -0.74
medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 0.33 -0.74 1.00

Correlogram showing correlations with p < .05


The correlation matrix shows that:

The crime rate of the suburb correlates moderately with accessibility to radial highways (r = .63) and property tax rate (r = .58).

Crime is also correlated with low-to-moderate positive associations (.30 < r < .50) with

  • indus (proportion of non-retail business)
  • nox (nitrogen oxides concentration)
  • age (proportion of owner-occupied units built prior to 1940)
  • lstat (lower status of the population)

…and low-to-moderate negative associations (-.30 > r > -.50) with

  • dis (distances to employment centres)
  • black (the proportion of blacks by town)
  • medv (median value of owner-occupied homes)


Some other notable relationships:

  • tax (property tax rate) is very strongly correlated (r = .91) with rad (accessibility to radial highways) and strongly correlated (r =.72) with indus (proportion of non-retail business)
  • dis (distances to employment centers) is strongly negatively correlated (-.71 > r > -.77) with indus (proportion of non-retail business), nox (nitrogen oxides contration), and age (units built prior to 1940)
  • lstat (lower status of the population) is strongly negatively correlated (r = -.74) with medv (median value of homes)



Task 4: Forming the train and test datas

Standardize the dataset and print out summaries of the scaled data.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865


How did the variables change?

All the variables have a mean of 0 and a standard deviation of 1.


Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset.


Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.



Task 5: Linear discriminant analysis

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2500000 0.2648515 0.2475248 
## 
## Group means:
##                  zn      indus         chas        nox          rm
## low       0.8716524 -0.8803768 -0.149294685 -0.8533447  0.40762922
## med_low  -0.1306383 -0.2367523  0.039520456 -0.5457531 -0.17324299
## med_high -0.3822513  0.1646256  0.242805543  0.3702445  0.03639424
## high     -0.4872402  1.0149946  0.003267949  1.0114540 -0.40974575
##                 age        dis        rad        tax     ptratio
## low      -0.8473887  0.8291863 -0.6899692 -0.7167540 -0.43703575
## med_low  -0.3511617  0.3105986 -0.5509118 -0.4585735 -0.07367113
## med_high  0.4106292 -0.3762669 -0.4001244 -0.3092791 -0.21732042
## high      0.8031266 -0.8369907  1.6596029  1.5294129  0.80577843
##               black       lstat        medv
## low       0.3849264 -0.76235175  0.47020076
## med_low   0.3149096 -0.12160794 -0.03672426
## med_high  0.1173919  0.06650001  0.09642381
## high     -0.8159334  0.86326616 -0.61278053
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.14685329  0.695866148 -0.98664839
## indus    0.02501573 -0.225436324  0.36498421
## chas    -0.06668855 -0.080607728  0.12893709
## nox      0.30382933 -0.799350373 -1.36308034
## rm      -0.12793383 -0.088045625 -0.15648759
## age      0.36162750 -0.217999587 -0.27431668
## dis     -0.05256980 -0.190727776  0.09791413
## rad      3.34710726  0.786005653 -0.13013548
## tax     -0.15355790  0.173955492  0.63886266
## ptratio  0.18126338 -0.007151421 -0.33725208
## black   -0.15657349 -0.031270869  0.08722478
## lstat    0.13959963 -0.319265385  0.32399903
## medv     0.17511809 -0.358315838 -0.22166969
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9514 0.0357 0.0129
Table 4
zn indus chas nox rm age dis rad tax ptratio black lstat medv
low 0.87 -0.88 -0.15 -0.85 0.41 -0.85 0.83 -0.69 -0.72 -0.44 0.38 -0.76 0.47
med_low -0.13 -0.24 0.04 -0.55 -0.17 -0.35 0.31 -0.55 -0.46 -0.07 0.31 -0.12 -0.04
med_high -0.38 0.16 0.24 0.37 0.04 0.41 -0.38 -0.40 -0.31 -0.22 0.12 0.07 0.10
high -0.49 1.01 0.00 1.01 -0.41 0.80 -0.84 1.66 1.53 0.81 -0.82 0.86 -0.61



Task 6: Predicting with the model

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset.


Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.

Table 5 Correct and predicted crime classes cross-tabulated
Correct crime class
Predicted crime class
low med_low med_high high
low 20 9 2 0
med_low 6 11 8 0
med_high 0 2 17 0
high 0 0 1 26
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low med_low med_high high
##   low       20       6        0    0
##   med_low    9      11        2    0
##   med_high   2       8       17    1
##   high       0       0        0   26
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7255          
##                  95% CI : (0.6282, 0.8092)
##     No Information Rate : 0.3039          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6345          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: low Class: med_low Class: med_high Class: high
## Sensitivity              0.6452         0.4400          0.8947      0.9630
## Specificity              0.9155         0.8571          0.8675      1.0000
## Pos Pred Value           0.7692         0.5000          0.6071      1.0000
## Neg Pred Value           0.8553         0.8250          0.9730      0.9868
## Prevalence               0.3039         0.2451          0.1863      0.2647
## Detection Rate           0.1961         0.1078          0.1667      0.2549
## Detection Prevalence     0.2549         0.2157          0.2745      0.2549
## Balanced Accuracy        0.7803         0.6486          0.8811      0.9815


Overall the prediction accuracy of the model is about 70 % (changes when the code is rerun, because the test data is sampled randomly every time), which is quite a high accuracy rate with 4 classes. It seems that the predictions are more accurate when the true high crime rate is high.



Task 7: K-Means clustering

Reload the Boston dataset and standardize the dataset.


Calculate the distances between the observations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero


It seems the most radical change (the “elbow”) in the total of within cluster sum of squares (WCSS) happens between one and two clusters. Although WCSS decreases with increase in clusters, the changes become increasingly subtle. Therefore, I chose a 2 cluster solution.

The clusters seem to differ in crime rate with cluster 2 with virtually no crime. The clusters also differ in proportion of non-retail business, nitrogen oxide concentration, proportion of old buildings, distance to employment centers, accessibilty of highways, property tax, and pupil-teacher ratio.









5 Dimension reduction

Task 1: Show a graphical overview of the data

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.



Original data from is from United Nations’ Human Development Reports. The data combines several indicators from most countries in the world. Our modified analysis dataset has following variables:


Table 1 Variables in the dataset
Variables Information
edu2FM Proportion of females with at least secondary education devided by proportion of men with at least secondary education
labFM Proportion of femals in the larbour force devided by proportion of males in the labour force
edu.exp Expected years of schooling
life.exp Life expectancy at birth
gni Gross National Income per capita
mat.mor Maternal mortality ratio
ado.birth Adolescent birth rate
parlF Proportion of female representatives in parliament

(Script for data modification can be found here.)


## Observations: 155
## Variables: 8
## $ edu2FM    <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.969060...
## $ labFM     <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.828611...
## $ edu.exp   <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9...
## $ life.exp  <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0...
## $ gni       <int> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 529...
## $ mat.mor   <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 1...
## $ ado.birth <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3...
## $ parlF     <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2...
##      edu2FM           labFM           edu.exp         life.exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni            mat.mor         ado.birth          parlF      
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
Table 2 Descriptives
vars n mean sd median trimmed mad min max range skew kurtosis se
edu2FM 1 155 0.85 0.24 0.94 0.87 0.12 0.17 1.50 1.33 -0.76 0.55 0.02
labFM 2 155 0.71 0.20 0.75 0.73 0.17 0.19 1.04 0.85 -0.87 0.05 0.02
edu.exp 3 155 13.18 2.84 13.50 13.24 2.97 5.40 20.20 14.80 -0.20 -0.34 0.23
life.exp 4 155 71.65 8.33 74.20 72.40 7.56 49.00 83.50 34.50 -0.76 -0.15 0.67
gni 5 155 17627.90 18543.85 12040.00 14552.58 13337.47 581.00 123124.00 122543.00 2.14 6.83 1489.48
mat.mor 6 155 149.08 211.79 49.00 104.70 63.75 1.00 1100.00 1099.00 2.03 4.16 17.01
ado.birth 7 155 47.16 41.11 33.60 41.62 35.73 0.60 204.80 204.20 1.13 0.89 3.30
parlF 8 155 20.91 11.49 19.30 20.32 11.42 0.00 57.50 57.50 0.55 -0.10 0.92


Most of the vairables are quite symmetrically distributed. However, GNI and maternal mortality are strongly positively skewed. There is also some positive skewness in adolescent birth rate. Life expectancy is correalted with maternal mortality (r = −.86), expected education, (r = .79), adolescent birht rate (r = −.73), GNI (r = .63), and education ratio (edu2FM; r = .58). Expected education has similar correlations with other variables. GNI is also correlated with life expectancy (r = .63), expected education (r = .62), education ratio (r = .58), adolescent birth rate (r = −.56) and maternal mortality (r = −.50).



Tasks 2: Principal component analysis

Perform principal component analysis (PCA) on the not standardized human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.

## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped


Task 3: PCA with standardized variables

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to.

##      edu2FM            labFM            edu.exp           life.exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       gni             mat.mor          ado.birth           parlF        
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 53.61 16.24  9.57  7.58  5.48  3.60  2.63  1.30


GNI dominates the components in PCA with un-standardized variables and the first principal component accounts for almost 100 % of the total variance of the observed vairables, whereas the results of PCA with standardized variables are make more sense. The differences are due to the standardization. More specifically, when principal components are extracted from the covariance matrix (un-standardzed), the results depend on the units of measurement, and large differences between the variances of the orginal variances affect the solution. Extracting the principal components from the correlation matrix (standardized to have unit of variance) makes the variables “equally important”.

The biplot of standardized PCA show that parlF and labFM load mostly to the second component PC2, which accounts for 16 % of the total variance in the variables, while other variables load to the first component PC1, which accounts for 54 % of the variance.


Task 4: Interpreting the results

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.


The first component seems to be the measure of general well-being indicators, such as GNI, life expectancy, education, mortality, and birth rate, whereas the second component seems to capture the variance in gender equality indicators, namely proportion of female representatives in parliament and the Female-Male ratio in labor force.



Task 5: Multiple Correspondence Analysis

Load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.

The Tea dataset

300 tea consumers have answered a survey about how they drink tea, what are their product’s perception and some personal details. Except for the age, all the variables are categorical. For the age, the data set has two different variables: a continuous and a categorical one.

## Observations: 300
## Variables: 36
## $ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.brea...
## $ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea...
## $ evening          <fct> Not.evening, Not.evening, evening, Not.evenin...
## $ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, N...
## $ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.d...
## $ always           <fct> Not.always, Not.always, Not.always, Not.alway...
## $ home             <fct> home, home, home, home, home, home, home, hom...
## $ work             <fct> Not.work, Not.work, work, Not.work, Not.work,...
## $ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.te...
## $ friends          <fct> Not.friends, Not.friends, friends, Not.friend...
## $ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.r...
## $ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, ...
## $ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey...
## $ How              <fct> alone, milk, alone, alone, alone, alone, alon...
## $ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, N...
## $ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, ...
## $ where            <fct> chain store, chain store, chain store, chain ...
## $ price            <fct> p_unknown, p_variable, p_variable, p_variable...
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, ...
## $ SPC              <fct> middle, middle, other worker, student, employ...
## $ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsma...
## $ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-...
## $ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3...
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.e...
## $ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spiri...
## $ healthy          <fct> healthy, healthy, healthy, healthy, Not.healt...
## $ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diureti...
## $ friendliness     <fct> Not.friendliness, Not.friendliness, friendlin...
## $ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not...
## $ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not...
## $ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sop...
## $ slimming         <fct> No.slimming, No.slimming, No.slimming, No.sli...
## $ exciting         <fct> No.exciting, exciting, No.exciting, No.exciti...
## $ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing,...
## $ effect.on.health <fct> No.effect on health, No.effect on health, No....
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 


## Warning: attributes are not identical across measure variables;
## they will be dropped

## Warning: attributes are not identical across measure variables;
## they will be dropped

## Warning: attributes are not identical across measure variables;
## they will be dropped

## Warning: attributes are not identical across measure variables;
## they will be dropped


Selecting some variables for analysis:

## Observations: 300
## Variables: 6
## $ Tea      <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl G...
## $ How      <fct> alone, milk, alone, alone, alone, alone, alone, milk,...
## $ how      <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag,...
## $ sugar    <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar,...
## $ where    <fct> chain store, chain store, chain store, chain store, c...
## $ tea.time <fct> Not.tea time, Not.tea time, tea time, Not.tea time, N...
## 
## Call:
## MCA(X = tea_, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.283   0.268   0.216   0.186   0.173   0.164
## % of var.             15.455  14.611  11.777  10.169   9.417   8.956
## Cumulative % of var.  15.455  30.066  41.843  52.013  61.430  70.386
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.133   0.116   0.087   0.062
## % of var.              7.880   7.262   6.345   4.735   3.393
## Cumulative % of var.  78.265  85.527  91.872  96.607 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.453  0.242  0.169 |  0.235  0.069  0.046 | -0.251
## 2                  | -0.310  0.113  0.056 |  0.045  0.003  0.001 | -0.672
## 3                  | -0.316  0.117  0.145 | -0.030  0.001  0.001 | -0.303
## 4                  | -0.696  0.571  0.609 |  0.208  0.054  0.055 |  0.270
## 5                  | -0.505  0.300  0.329 |  0.189  0.044  0.046 | -0.189
## 6                  | -0.505  0.300  0.329 |  0.189  0.044  0.046 | -0.189
## 7                  | -0.316  0.117  0.145 | -0.030  0.001  0.001 | -0.303
## 8                  | -0.121  0.017  0.009 | -0.174  0.038  0.018 | -0.785
## 9                  |  0.495  0.288  0.133 | -0.840  0.879  0.383 |  0.058
## 10                 |  0.597  0.420  0.197 | -0.424  0.224  0.100 | -0.387
##                       ctr   cos2  
## 1                   0.097  0.052 |
## 2                   0.697  0.261 |
## 3                   0.141  0.133 |
## 4                   0.113  0.092 |
## 5                   0.055  0.046 |
## 6                   0.055  0.046 |
## 7                   0.141  0.133 |
## 8                   0.952  0.375 |
## 9                   0.005  0.002 |
## 10                  0.231  0.083 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.556   4.493   0.101   5.506 |  -0.084   0.109
## Earl Grey          |  -0.221   1.842   0.088  -5.124 |  -0.167   1.116
## green              |   0.043   0.012   0.000   0.259 |   1.166   9.301
## alone              |  -0.088   0.294   0.014  -2.067 |   0.207   1.735
## lemon              |   0.722   3.371   0.064   4.388 |  -0.103   0.073
## milk               |  -0.241   0.720   0.015  -2.152 |  -0.322   1.354
## other              |   0.944   1.572   0.028   2.870 |  -1.856   6.432
## tea bag            |  -0.670  14.981   0.588 -13.257 |   0.081   0.233
## tea bag+unpackaged |   0.642   7.594   0.188   7.498 |  -0.771  11.595
## unpackaged         |   1.490  15.667   0.303   9.513 |   1.630  19.843
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.002  -0.835 |  -1.011  19.447   0.334 -10.000 |
## Earl Grey            0.050  -3.877 |   0.441   9.676   0.351  10.251 |
## green                0.168   7.087 |  -0.315   0.844   0.012  -1.917 |
## alone                0.080   4.880 |  -0.179   1.612   0.060  -4.223 |
## lemon                0.001  -0.626 |   1.589  21.447   0.312   9.661 |
## milk                 0.028  -2.870 |  -0.073   0.085   0.001  -0.647 |
## other                0.107  -5.645 |  -1.436   4.773   0.064  -4.366 |
## tea bag              0.009   1.606 |  -0.123   0.661   0.020  -2.431 |
## tea bag+unpackaged   0.271  -9.008 |   0.175   0.739   0.014   2.042 |
## unpackaged           0.362  10.410 |   0.124   0.142   0.002   0.791 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.108 0.169 0.388 |
## How                | 0.101 0.154 0.362 |
## how                | 0.650 0.509 0.020 |
## sugar              | 0.093 0.001 0.410 |
## where              | 0.658 0.660 0.091 |
## tea.time           | 0.090 0.114 0.025 |

The first dimension accounts for 15.5 % and the second dimension for 14.6 % of the total inertia. The first dimension seems to be related to the packaging of the tea and where it is bought. On the other end there is tea bags and tea from chain stores and on the other there is unpackacked tea and tea shops. The second dimension seems to be about the blend and what the tea is enjoyed with. Green tea and no seasoning locate on the upper half of the figure while black tea and Earl Grey, and milk and lemon are on the middle. The choice of other is on the bottom.